AI safety
AI Security
AI事業者ガイドライン
技術未来予想
LLM safety
Deep Fake
SRE
Mechanistic Interpretability for AI Safety -- A Review
https://arxiv.org/abs/2404.14082
大規模言語モデルにおける安全性の実現と方向性
https://llmc.nii.ac.jp/wp-content/uploads/2024/10/20240925_t4_sekine.pdf
Robust Intelligence
https://www.robustintelligence.com/
citadel AI
https://www.citadel.co.jp/
渋谷の牛タン屋で横にいたカップルとAI開発における演繹と帰納について
https://storialaw.jp/blog/4532
ChatGPT vs BERT:どちらが日本語をより理解できるのか?
https://fintan.jp/page/9126/
オープンソースLLMの日本語評価結果 - W&Bローンチで誰でも再現可能に
https://note.com/wandb_jp/n/n2464e3d85c1a
lm-evaluation-harness
https://github.com/EleutherAI/lm-evaluation-harness
第95回 Machine Learning 15minutes! Hybrid 切り抜き
https://www.youtube.com/watch?v=w8M7DRVOR54
「AI Safety の必要性と具体的な攻撃、その対策について」松尾研 LLM コミュニティ "Paper & Hacks Vol.30"
https://www.youtube.com/watch?v=ji1G90kUel8
「AI Safety の必要性と具体的な攻撃、その対策について」
https://www.youtube.com/watch?v=ji1G90kUel8
HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection
https://arxiv.org/abs/2409.17504
LLM Guard - The Security Toolkit for LLM Interactions
https://github.com/protectai/llm-guard
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
https://arxiv.org/abs/2501.08292
GuardReasoner: Towards Reasoning-based LLM Safeguards
https://arxiv.org/abs/2501.18492
OpenAIのModeration APIを利用してAI彼女を性的被害から守る
https://pixel-freak.com/blog/openai-moderation-api
NeMo Framework で実践する継続事前学習 – 日本語 LLM 編 –
https://developer.nvidia.com/ja-jp/blog/how-to-use-continual-pre-training-with-japanese-language-on-nemo-framework
COLING 2025 Tutorial: Safety Issues for Generative AI
https://librairesearch.github.io/tutorial/static/slides/Safety_Issues_of_GenAI.pdf
https://librairesearch.github.io/tutorial/index.html
AIセーフティ年次レポート2024
https://aisi.go.jp/effort/effort_information/250207_3/
OpenAIのModeration API
https://weel.co.jp/media/moderation-api/
OWASP(Open Web Application Security Project)について
https://zenn.dev/mukkun69n/articles/3f6e689d3cfa87
Jailbreak で遊べるゲーム AILBREAK を開発しました
https://note.com/schroneko/n/n3c8ce016a38b
ASI existential risk: reconsidering alignment as a goal
https://michaelnotebook.com/xriskbrief/index.html
Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment
https://arxiv.org/abs/2504.15585
Start your Trustworthy AI Development with Safety Leaderboards in Azure AI Foundry
https://techcommunity.microsoft.com/blog/aiplatformblog/start-your-trustworthy-ai-development-with-safety-leaderboards-in-azure-ai-found/4425165
How much novel security-critical infrastructure do you need during the singularity?
https://www.alignmentforum.org/posts/qKz2hBahahmb4uDty/how-much-novel-security-critical-infrastructure-do-you-need
An Approach to Technical AGI Safety and Security
https://arxiv.org/abs/2504.01849
Six Thoughts On AI Safety
https://windowsontheory.org/2025/01/24/six-thoughts-on-ai-safety/
From Shift Left to Shift Up: Securing the New AI Abstraction Layer
https://www.pillar.security/blog/from-shift-left-to-shift-up-securing-the-new-ai-abstraction-layer
Building and evaluating alignment auditing agents
https://alignment.anthropic.com/2025/automated-auditing/